Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent

نویسنده

Wei Xu

چکیده

For large scale learning problems, it is desirable if we can obtain the optimal model parameters by going through the data in only one pass. Polyak and Juditsky (1992) showed that asymptotically the test performance of the simple average of the parameters obtained by stochastic gradient descent (SGD) is as good as that of the parameters which minimize the empirical cost. However, to our knowledge, despite its optimal asymptotic convergence rate, averaged SGD (ASGD) received little attention in recent research on large scale learning. One possible reason is that it may take a prohibitively large number of training samples for ASGD to reach its asymptotic region for most real problems. In this paper, we present a finite sample analysis for the method of Polyak and Juditsky (1992). Our analysis shows that it indeed usually takes a huge number of samples for ASGD to reach its asymptotic region for improperly chosen learning rate. More importantly, based on our analysis, we propose a simple way to properly set learning rate so that it takes a reasonable amount of data for ASGD to reach its asymptotic region. We compare ASGD using our proposed learning rate with other well known algorithms for training large scale linear classifiers. The experiments clearly show the superiority of ASGD.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stochastic Optimization for Large-scale Optimal Transport

Optimal transport (OT) defines a powerful framework to compare probability distributions in a geometrically faithful way. However, the practical impact of OT is still limited because of its computational burden. We propose a new class of stochastic optimization algorithms to cope with large-scale OT problems. These methods can handle arbitrary distributions (either discrete or continuous) as lo...

متن کامل

Linear Learning with Sparse Data

Linear predictors are especially useful when the data is high-dimensional and sparse. One of the standard techniques used to train a linear predictor is the Averaged Stochastic Gradient Descent (ASGD) algorithm. We present an efficient implementation of ASGD that avoids dense vector operations. We also describe a translation invariant extension called Centered Averaged Stochastic Gradient Desce...

متن کامل

Towards stability and optimality in stochastic gradient descent

Lemmas 1, 2, 3 and 4, and Corollary 1, were originally derived by Toulis and Airoldi (2014). These intermediate results (and Theorem 1) provide the necessary foundation to derive Lemma 5 (only in this supplement) and Theorem 2 on the asymptotic optimality of θ̄n, which is the key result of the main paper. We fully state these intermediate results here for convenience but we point the reader to t...

متن کامل

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to Zhang et al. (2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show th...

متن کامل

True Asymptotic Natural Gradient Optimization

We introduce a simple algorithm, True Asymptotic Natural Gradient Optimization (TANGO), that converges to a true natural gradient descent in the limit of small learning rates, without explicit Fisher matrix estimation. For quadratic models the algorithm is also an instance of averaged stochastic gradient, where the parameter is a moving average of a “fast”, constant-rate gradient descent. TANGO...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1107.2490 شماره

صفحات -

تاریخ انتشار 2011

Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent

نویسنده

چکیده

منابع مشابه

Stochastic Optimization for Large-scale Optimal Transport

Linear Learning with Sparse Data

Towards stability and optimality in stochastic gradient descent

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

True Asymptotic Natural Gradient Optimization

عنوان ژورنال:

اشتراک گذاری